Rapid taxonomic categorization of short, abundant virus sequences for ecological analyses

Abstract Public health concerns about recent viral epidemics have motivated researchers to seek novel ways to understand pathogen infection in native, wildlife hosts. With its deep history of tools and perspectives for understanding the abundance and distribution of organisms, ecology can shed new light on viral infection dynamics. However, datasets allowing deep explorations of viral communities from an ecological perspective are lacking. We sampled 1086 bats from two, adjacent Puerto Rican caves and tested them for infection by herpesviruses, resulting in 3131 short, viral sequences. Using percent identity of nucleotides and a machine learning algorithm (affinity propagation), we categorized herpesviruses into 43 operational taxonomic units (OTUs) to be used in place of species in subsequent ecological analyses. Herpesvirus metacommunities demonstrated long‐tailed rank frequency distributions at all analyzed levels of host organization (i.e., individual, population, and community). Although 13 herpesvirus OTUs were detected in more than one host species, OTUs generally exhibited host specificity by infecting a single core host species at a significantly higher prevalence than in all satellite species combined. We describe the natural history of herpesvirus metacommunities in Puerto Rican bats and suggest that viruses follow the general law that communities comprise few common and many rare species. To guide future efforts in the field of viral ecology, hypotheses are presented regarding mechanisms that contribute to these patterns.


| INTRODUC TI ON
Ecology has a long history of quantifying the abundance and distribution of organisms in nature, and in recent years has been increasingly focused on the study of viruses (Anthony et al., 2015(Anthony et al., , 2017;;Diaz-Munoz, 2017;Gao et al., 2022;Holmes, 2022;Jokinen et al., 2023;Malmstrom et al., 2011;Olayemi & Fichet-Calvet, 2020;Plowright et al., 2015).These efforts have mostly centered on relationships between single host species and single pathogen species, adopting a reductionist approach and studying such systems in isolation.Although this monothetic methodology has been successful in the control of known diseases (e.g., rabies: Benavides et al., 2020;Mshelbwala et al., 2016;Reynolds et al., 2015;Hendra: Eby et al., 2020;Plowright et al., 2008;Plowright et al., 2011;Marburg: Amman et al., 2012;Amman et al., 2020;Towner et al., 2009), it likely falls short of capturing the full complexity and scope of interactions that contribute to ecological patterns.Infected host individuals often contain multiple viruses and transmit them among cohabitating host species.Infection with one virus can affect infection, virulence, or disease emergence of others (Anthony et al., 2015;Seabloom et al., 2015;Telfer et al., 2010).Thus, it is essential to better understand the role of community-level processes and their potential influence on viral abundance and distribution.
Several challenges exist when integrating community ecology with virology.First, most wildlife viruses are unknown to science (Anthony et al., 2013(Anthony et al., , 2017;;Carlson et al., 2019), and it is difficult to study the ecological processes operating within viral communities without first sorting viruses into species or equivalent units.Additionally, most viruses are rare (i.e., exist at low prevalence; Anthony et al., 2015;Tang et al., 2006;Wacharapluesadee et al., 2010;Winker et al., 2008), and many do not persist in host individuals for long periods of time (Henaux & Samuel, 2011;King et al., 2012;Klenk et al., 2004;Lee et al., 2009;Vetter et al., 2016).
Thus, the probability of infection by any one virus at any particular time is low.Furthermore, large datasets of comprehensively sampled host populations are needed to address high levels of uncertainty associated with low viral prevalence.Such datasets are atypical and generally consist of partial, and often only small (e.g., <300 base pairs [bp]) genome sequences.Finally, the process of defining official virus species according to the International Committee on the Taxonomy of Viruses (ICTV) is thorough but also complex and time consuming.
Proactively analyzing and identifying ecological patterns for developing public health threats, however, require fast, functional, and biologically relevant viral classification to support scientific study until the more systematic ICTV classifications can transpire.Fortunately, many analyses of community ecology, specifically, do not require identities of species (e.g., the species abundance distribution, empirical cumulative distribution; Fisher et al., 1943;MacArthur, 1957;McGill et al., 2007;Preston, 1960;Raunkiaer, 1909), so alternative classification schemes can be used to show macro-ecological patterns or compositional trends.For viruses, such schemes include delineating operational taxonomic units (OTUs) using monophyletic groups (Anthony et al., 2015) or pairwise distances (Lauber & Gorbalenya 2012;Maes et al., 2009;Muhire et al., 2014), but these methods rely on partially subjective cutoffs or full-length sequences to differentiate OTUs.A new, more objective method is needed that utilizes highly abundant and readily available short sequence data and allows timely completion of ecological analyses.
We address some of the challenges associated with studying the ecological processes that shape viral communities using imperfect but comparatively easily gathered data.We use herpesviruses as a model taxon because they generally occur at high prevalence (Cortez et al., 2008;Imbronito et al., 2008;Phalen et al., 2017;Tazikeh et al., 2019) and establish latent, long-term infections, often for the entirety of the host's life (King et al., 2012).They can also be detected easily using consensus polymerase chain reaction (cPCR; Van Devanter et al., 1996;Wibbelt et al., 2007;Wray et al., 2016) and frequently co-infect the same host individual (Azab et al., 2019;Elhassan et al., 2016;Seimon et al., 2015).
Herpesviruses are large (~300,000 bp), enveloped, doublestranded DNA viruses that infect all vertebrate classes and mollusks (Azab et al., 2018;King et al., 2012).They were first detected in bats in 1996 (Tandler, 1996) and have since been reported in many bat species (Holz et al., 2018;Pozo et al., 2016;Razafindratsimandresy et al., 2009;Sasaki et al., 2014;Wada et al., 2018;Wibbelt et al., 2007;Zhang et al., 2012;Zheng et al., 2016).Bats are even suggested to have played a critical role in the evolution and diversification of New World herpesviruses (Escalera-Zamudio et al., 2016), making them an important host taxon for the study of herpesvirus ecology more broadly.Despite recent advances, much about bat herpesviruses remains unknown, including the diversity of herpesviruses in different bat species, the degree to which bat herpesviruses are host specific, and the factors that contribute to viral sharing between hosts.
Multiple bat species come into close proximity within caves where they roost, presenting opportunities for cross-species transmission (Simonis & Becker, 2023).Indeed, cross-species transmission of herpesviruses and host switching are common in bat hosts and have likely been the norm throughout the evolutionary history of these viruses (Azab et al., 2018;Escalera-Zamudio et al., 2016;Wada et al., 2018;Zheng et al., 2016).Cross-species transmission is particularly evident within the subfamily Gammaherpesvirinae, as its constituent species show less strict host-pathogen co-phylogeny with bat hosts than do viral species within the Alphaherpesvirinae or Betaherpesvirinae (Azab et al., 2018;Escalera-Zamudio et al., 2016;Wada et al., 2018;Zheng et al., 2016).The ecological or genetic reasons underlying differences in virus sharing between host species are unknown.
The goal of this study was to implement a quantitative, objective approach to determine viral units from a large and relatively easily collected sequence dataset and then use the units to (1) estimate total herpesvirus diversity, (2) describe patterns of community distribution and co-infection, and (3) examine host-specificity and spillover for herpesviruses in cave-dwelling bats of Puerto Rico.Because most viruses are rare (Young & Olival, 2016), and because hyperbolic species abundance distributions (SADs) represent a general ecological law (McGill et al., 2007), we predicted that herpesvirus metacommunities would comprise few common and many rare viruses.Well-established understanding of strong host-specificity in herpesviruses, coupled with recent insights highlighting host sharing in bats as an important eco-evolutionary mechanism shaping herpesvirus diversity more broadly (Azab et al., 2018;Escalera-Zamudio et al., 2016;Wada et al., 2018;Zheng et al., 2016), suggest that (1) closely related hosts regularly share herpesviruses on an ecological timescale and (2) patterns of specificity (e.g., high infection prevalence in a single host species) develop over longer, evolutionary time periods.We therefore hypothesized that we would see instances of herpesvirus spillover among bat species, but that virus OTUs would show a strong preference for one, primary host species.In these spillover recipients, viruses would exist at a lower prevalence, because the viruses are not as strongly adapted to secondary species.
Finally, in concert with common ecological occupancy-abundance patterns seen in macro-organisms (Gaston et al., 2000;Gaston & He, 2011;He & Gaston, 2003), we expect that viruses shared across multiple host species will be found at the highest prevalence in the bat community.
Culebrones Cave is structurally complex and hot, with temperatures reaching 40°C and relative humidity at 100%.It is home to an estimated 158,000 individual bats representing six species: Pteronotus quadridens, P. portoricensis, Mormoops blainvillei, Monophyllus redmani, Erophylla bombifrons, and Brachyphylla cavernarum (Rodríguez-Durán et al., 2023).Bats were sampled for 28 nights between June and August 2017.A harp trap was placed at sunset (approximately 18:00) immediately outside of the opening and monitored continually.Larvas Cave is cooler (ambient temperature), smaller, and less structurally complex compared to Culebrones Cave.It hosts a small total number of bats (30-200; ARS personal observation) representing two species, Artibeus jamaicensis and Eptesicus fuscus.Bats were sampled from Larvas Cave on seven instances between June and August 2017, using two different techniques.After sunset, mist nets were placed along a trail outside of the cave entrance to catch exiting bats and were checked at least every ten minutes.Hand nets were also used to capture bats roosting inside the cave.
Regardless of sampling method, each captured bat was placed into a porous cotton holding bag.Individuals were identified to species following Gannon et al. (2005)

| Viral screening
Total nucleic acids were extracted from each swab (one swab per bat; 1086 total) using the EasyMag platform (bioMerieux, Inc.), and cDNA was synthesized using SuperScript III first-strand supermix (Invitrogen).Synthesis of cDNA without DNAse treatment allows for the detection of both viral mRNA and genomic DNA and thus maximizes the likelihood of detection.Nested consensus polymerase chain reaction (cPCR) was performed twice on each sample, targeting a region of the highly conserved catalytic subunit of the DNA polymerase gene (Van Devanter et al., 1996).cPCR is a broadly reactive method that allows detection of both known and novel (i.e., genetic material not yet publicly available) viruses by using degenerate primers targeting viral sequences that are conserved at the level of virus family or genus (Anthony et al., 2013(Anthony et al., , 2015;;Van Devanter et al., 1996).The assay was performed twice to increase chances of herpesvirus detection, thereby partially accounting for imperfect detection associated with this method.
The cPCR protocol was modified slightly from that of Van Devanter (1996) to include the use of the QIAGEN Fast Cycling PCR kit: 95°C for 5 min, then 45 cycles of 96°C for 8 s and 68°C for 12 s, finished with 72°C for 2 min.Primer and dimethyl sulfoxide (DMSO) concentrations were identical to that of Van Devanter et al. (1996).
A constructed synthetic plasmid was used as a positive control to confirm effective implementation of the assay and to detect possible contamination.cPCR reactions were resolved on 1% agarose.
Products of the expected size (~225 bp) were excised, purified using Ultrafree-MC Centrifugal Filter (Millipore), and cloned into Strataclone PCR cloning vector.Twelve white colonies were then sequenced using conventional Sanger sequencing.Based on our previous experience, sequencing more than 12 colonies rarely resulted in the detection of additional genetic diversity.However, rare viruses present at very low copy number may have been missed.As rare viruses presumably contribute the least to community assembly processes, we suggest this limitation should not impact our overall findings.All sequences (clones) were cross-referenced against the GenBank nucleotide database using Blastn to confirm that they were herpesviruses.

| OTU classification
Empirical sequences were aligned using ClustalW, executed in Geneious version 11.0.5 (https:// www.genei ous.com).The alignment was trimmed to remove primer sequences on the 5′ and 3′ ends, resulting in an alignment of ~178 bp (Figure 1, Step 1).A percent identity (PID) matrix was calculated based on percent similarity of nucleotides in each sequence pair (Figure 1, Step 2), and a PID histogram was generated (Maes et al., 2009;Figure 1, Step 3a).
A two-step approach was then used to define OTUs using the PID matrix (Figure 1, Step 3a and 3b).First, the number of OTUs was estimated using the two visible PID cutoffs, the values of which are shown as breaks (or troughs) in the PID histogram (Maes et al., 2009; Step 1 also includes trimming GenBank sequences to the same ~178 base pair region as is sequenced for the empirical samples.In Step 2, percent identity (PID) matrices are calculated separately for each alignment.Each PID matrix is then used to create PID histograms (Step 3a) and run affinity propagation (Step 3b).Step 3b (affinity propagation) was repeated 25 times.The results of Steps 3a and 3b were used in tandem to determine the number of OTUs (i.e.virus species) in the alignment (Step 4)."Polymerase" is abbreviated as Pol, and "sequence(s)" is abbreviated as seq(s).(Frey & Dueck, 2007).In the case of viral sequences, exemplars are single sequences that are the best representatives (akin to type sequences) of an entire cluster, and one exemplar exists per cluster.
The remaining sequences are then clustered around the exemplars (see Appendix and Frey & Dueck, 2007).
When inputting the PID matrix, the number of clusters identified through affinity propagation can be interpreted as an optimal number of viral OTUs.Affinity propagation was used in place of other clustering algorithms (e.g., k-means) because the number of clusters (i.e., OTUs) was not a required, a priori input parameter.
Although affinity propagation has been used to define clusters of a single viral species (Fischer et al., 2017), this is the first instance in which it has been used to delineate viral units for the purpose of community-level analyses.To conduct affinity propagation analyses (Bodenhofer et al., 2011), the apcluster() command was run with a single line of code, inputting the PID matrix and using default values of input parameters (input data = NA, input preference = NA, q = 0.5, convits = 100, maxits = 1000, lam = 0.9).The results of PID histogram cutoffs and affinity propagation were then used in tandem to classify OTUs (Figure 1, Step 4, see Section 3).
To further explore the utility of PID histograms and affinity propagation as a two-step means to define appropriate viral OTUs, we gathered from GenBank full-length DNA polymerase sequences (~ 3000 bp each, representing about 1% of the full herpesvirus genome) for each of 16 established herpesvirus species, as ratified by the ICTV (King et al., 2012; Table A1).Additionally, to explore the method's sensitivity to sequence length, we trimmed the full-length polymerase sequences of ratified viral species to the same 178 bp region represented by the samples collected in Puerto Rico.We then performed the two-step method for defining OTUs on these two sets (i.e., full-length and trimmed) of established species.Further details are included in the Appendix.For ease of exposition, "species" will be used hereafter to reference host taxa and formally described viral taxa, whereas "OTU" will be used to reference an empirical viral taxonomic unit detected in Puerto Rican bats.
To compare OTUs with established herpesvirus species, a maximum likelihood (ML) cladogram was generated.Exemplar sequences, as defined using affinity propagation, were chosen to represent each OTU.Sequences were aligned using ClustalW and executed in Geneious version 11.0.5.ML trees were constructed in Mega7 (Kumar et al., 2018).For the beta subfamily, the best model was Kimura two-parameter + G + I, and for the gamma subfamily the best model was Hasegawa-Kishino-Yano + G + I (Hasegawa et al., 1985;Kimura, 1980).Node support was assessed using 500 bootstraps.

| Viral discovery curves
To determine how well the collected samples represent the true, imperfectly sampled metacommunity of herpesviruses infecting bats at Mata de Plátano Nature Reserve, viral discovery curves were created using the iNEXT package in R (Hsieh et al., 2016).By defining the metacommunity as all of the herpesviruses infecting all of the host species at Mata de Plátano Nature Reserve, an assumption is made that herpesviruses are shared between caves and among host species (i.e., no host specificity).Because this may not be a valid assumption, separate curves were generated for each bat species, as well as for each cave.Additionally, Chao2 estimates of richness (Chao, 1987) were calculated for the viral metacommunities of all bats at Mata de Plátano Nature Reserve, as well as for each cave and for each bat species separately.

| Rank frequency distributions
Distributions were plotted based on frequency of occurrence (i.e., prevalence; rank frequency distributions, or RFDs) of each OTU in the viral metacommunity at Mata de Plátano, as well as separately for each bat species or cave, and across host individuals.RFDs were used in place of species abundance distributions (SADs) or rank abundance distributions (RADs), as methods did not allow for quantification of viral abundance.Between all pairs of host species and caves, chi-squared goodness-of-fit tests compared the empirical RFDs based on simulated distributions involving 10,000 randomizations (Platefield, 1981) using the chisq.test()command in R.

| Host specificity
For each OTU that infected more than one host species, a host specificity index (S TD *) was calculated (Poulin & Mouillot, 2005) as where i and j represent index values of host species, p represents OTU prevalence within host populations, and ω ij is the phylogenetic distinctiveness of hosts.To measure phylogenetic distinctness, hosts of the same genus are assigned a value of one, hosts of the same family but different genera are assigned a value of two, and hosts of different families have a value of three (Poulin & Mouillot, 2005).S TD * thus weighs host preference by the relatedness of hosts (incorporating evolution) and the prevalence of the pathogen in each of its hosts (incorporating ecology).The extent of viral sharing among hosts species was further examined by determining whether particular OTUs had a primary host species.A primary host species was defined as a single host species for which an OTU had significantly higher prevalence than in all other host species combined.To do this, the prevalence of each OTU in its primary host and its combined prevalence in all other hosts were compared using Fisher's exact test (Wassertheil-Smoller & Smoller, 2015).
To put host specificity of herpesvirus into an ecological context, we next examined host use from an occupancy-abundance perspective.The number of infected host species was plotted against average prevalence across all host species, and across infected host species, for each viral OTU.In this regard, we used OTU frequency of occurrence (i.e., prevalence, or percent of individuals infected in a host population) as a proxy for abundance (e.g., number of individual virions infecting a host population).We then calculated Pearson's product-moment correlation for each relationship.
All analyses were performed using R version 3.4.

| OTU classification
PCR products from each of the 330 positive host individuals were cloned and sequenced.In total, 3131 sequences were generated from the 330 positive host individuals.To represent these sequences as OTUs for ecological analysis, a PID matrix was generated from a total clone alignment of all 3131 sequences.This resulted in two clearly differentiated potential cutoff ranges, as demonstrated by troughs in the histogram (Figure 2).
Two different approaches were used in combination to define OTUs from the PID matrix.In the first step, the number of OTUs were defined based on cutoffs that reflect breaks in the histogram (Maes et al., 2009).Four possible breaks were identified (blue and red vertical lines in Figure 2), indicating that several PID cutoff values could be used to delineate OTUs.One possible PID cutoff value for these data was 90% (solid red vertical line in Figure 2).This would mean that two sequences with less than 90% genetic similarity would be considered to be different OTUs.Conversely, any two sequences that share ≥90% sequence identity would be considered the same OTU.Other potential cutoff values were 74% (solid blue vertical line in Figure 2), 80% (dotted blue vertical line in Figure 2) and 92% (dotted red vertical line in Figure 2).The number of OTUs resulting from the 74%, 80%, 90%, and 92% cutoffs were 29, 31, 42, and 47 OTUs, respectively.
In contrast, affinity propagation resulted in 45 OTUs.This is within the range of OTUs defined using PID cutoffs at the higher PID values (i.e., 42-47 OTUs at 90%-92% similarity), indicating that the higher PID cutoffs are better supported from a quantitative perspective.A cutoff of 90% is also consistent with the genetic distance between established herpesvirus species within the betaherpesvirus and gammaherpesvirus subfamilies.When the GenBank sequences of ratified herpesvirus species were examined, no two TA B L E 1 Summary of capture numbers and virus screening results.betaherpesvirus or gammaherpesvirus species were greater than 90% similar (but see discussion for a single exception).Together, these results suggest that a cutoff of 90% is a quantitatively and biologically supported PID for classifying OTUs.Using the 90% cutoff, 43 OTUs were detected in empirical samples.
To validate the utility of the PID histogram-affinity propagation method, we evaluated how it would perform on 107 sequences representing 16 herpesvirus species ratified by the ICTV (mean = 6.69 sequences per species).The results suggest that affinity propagation is not limited by sequence length while using a conserved genome region, such as the polymerase, but is sensitive to the number of sequences available for each cluster.See Appendix for further details.

| Rank frequency distributions
The herpesvirus metacommunity infecting bats of Mata de Plátano Nature Reserve exhibits a long-tailed distribution with few common and many rare OTUs (Figure 4).When examined at the scales of cave and host population, viral metacommunities showed similarly rightskewed patterns (Figure 4).At the scale of host population, after accounting for experiment-wise error rate, RFDs were significantly different only between Mor. blainvillei and Mon.redmani (p = .002; Table 2).Pairwise comparisons of RFDs between all other species, and between caves, were not significant (Table 2).Because only three OTUs were detected in P. portoricensis, the RFD for this species was not compared to those of other species.Additionally, the RFD for Larva Cave is identical to that of A. jamaicensis, as this was the only host species roosting in Larva Cave that tested positive for any viruses.

| Host specificity
Thirteen OTUs were detected in more than one host species (Table 3), and the remaining 30 OTUs were each found in a single host species.S TD * for each of the shared OTUs ranged from one (infecting two hosts from the same genus) to three (infecting three hosts of two families).Nine OTUs had significantly higher prevalence in one host species than in all other host species combined.
When infection prevalence across all species, regardless of infection status, was compared to the number of hosts infected for each OTU, a significant, positive correlation emerged (r = .544,p < .001; Figure 5a).However, when the relationship was examined across only those species testing positive for each OTU, the correlation was not significant (r = .135,p = .389;Figure 5b).

F I G U R E 2
Percent Identity (PID) histogram of herpesviruses calculated using an alignment of all 3131 clones.The x-axis represents the percent identity of clone sequences compared pairwise.The y-axis represents the frequency of occurrence.Colored, vertical lines represent cutoff points corresponding to two distinct troughs, with blue lines representing a less specific, cutoff (analogous to genus or subfamily) and the red lines representing a more specific cutoff (analogous to species, subspecies, or strain).Solid lines represent the low end of the cutoff ranges, and the dashed lines represent the upper end of the cutoff ranges.Four distinct PID cutoffs are demonstrated (74%, 80%, 90%, and 92%), resulting in four different numbers of OTUs (29,31,42,and 47,respectively).
An ecological community can be defined as two or more taxa sharing the same space at the same time, with the potential for interaction (Begon et al., 1996;Mittelbach & McGill, 2019).Therefore, to study ecological communities, distinct, replicable operational units, or taxa, must be classified.Although community ecology can be studied using various taxonomic levels (e.g., genus; erally required to choose the best cutoff in terms of genetic similarity (i.e., the histogram method) or evolutionary distance (i.e., the monophyletic groups method).Although subjectivity is not required in the case of using genetic divergence (Lauber & Gorbalenya, 2012, Muhire et al., 2014), the methods were developed specifically using full genomes, or the longest possible sequences, and they require use of multiple programs or coding languages.

| Taxonomic categorization
Affinity propagation is a machine-learning algorithm developed to cluster a broad range of data (e.g., images of faces, genes in a  (Frey & Dueck, 2007).The method has been used to cluster strains of rabies viruses (Fischer et al., 2017) but has never been used to identify viral OTUs for analyses in the context of community ecology.Here, we explore the utility of affinity propagation within the context of a molecularly superficial ecological survey of herpesvirus diversity.We integrate affinity propagation with percent identity histograms to classify short viral sequences into robust OTUs that we then use to explore patterns of co-infection and viral sharing among co-roosting host species.
Although this method does not contain the depth and quantity of information that would be available if each OTU went through the ICTV process to define official species, it is a reasonable, quickly implemented method that can be used to reliably and objectively demonstrate ecological patterns until completing further taxonomic work.Additionally, this two-step methodology could be easily applied to any measure of similarity for which a PID matrix can be calculated, including amino acid sequence as well as phylogenetic or functional similarity.
Using a combined approach, we delineated 43 herpesvirus OTUs in 330 bats of seven species.The extra step of affinity propagation used the same data input as the first step of histogram generation, required only a single line of code, and took less than three minutes to implement.Additionally, we showed that OTU delineation was consistent with current taxonomic classifications of known herpesvirus species, suggesting that this combined approach has considerable biological and practical merit.Importantly, we do not suggest this method as replacement for the more in-depth, holistic viral classification conducted by the ICTV.Instead, we show that it can serve as a biologically relevant and quantitatively-supported method to cluster novel viral sequences that have not yet been fully investigated by the ICTV.We recognize that, with time, and through the iterative process of science, the exact species limits of the viruses evaluated here may differ from OTU delineations.
Importantly, we do not advocate for affinity propagation as a "stand-alone" method for clustering viral sequences into OTUs.
Rather, we suggest using it to provide an objective, quantitative support for delineating clusters based on a chosen comparative metric, such as genetic or phylogenetic relatedness.Here we combined affinity propagation with PID histograms and note that our ability to define cutoffs would be made more difficult if discrete troughs were not observed in the PID histograms.Clear troughs have also been identified in other viral taxa (e.g., Hantaviruses; Maes et al., 2009), showing that there would be utility to our combined approach beyond this study.Affinity propagation could also be easily applied alongside monophyletic grouping when robust phylogenies are available.

| Integrating viral ecology with community ecology
Although this methodology, and thus this work, does not identify novel viruses, we show progress in integrating viral ecology with community ecology, as both fields involve focusing on emergent properties of a profile of taxonomic units, instead of concerning single entities, the identity and function of which may be unknown (Jurburg et al., 2022;McGill et al., 2007).Microbial ecologists facing a similar lack of specific bacterial identities have advanced their field by emphasizing compositional data, relative analyses such as ratios, and global associations between community structure and phenotypes of interest (Gloor et al., 2016;Jurburg et al., 2022;Pawlowsky-Glahn et al., 2015;Rivera-Pinto et al., 2018).Therefore, we are confident that descriptions of herpesvirus communities that lack properly identified species can still provide foundational data to inform the nascent field of viral community ecology.
Species accumulation curves for herpesviruses show that the sequence diversity detected in this study is incomplete.At the largest scale (i.e., the entire bat community), sampling completeness reached 80%.This is in contrast to accumulation curves for viral metacommunities infecting individual host populations, which were as low as 55% (P.quadridens).This suggests that undiscovered OTUs at the scale of host population have already been discovered at the scale of host community (i.e., have already been discovered in another host species).This supports the hypothesis that herpesviruses are shared among host species, and that certain OTUs that are common in a primary host are rare in all other host species combined.

| Structuring mechanisms
One mechanism that could be at play in shaping the long-tailed RFD for herpesviruses is temporal turnover of OTUs resulting from viral latency.Although the lack of temporal turnover of OTUs is a negligible factor in herpesvirus detection efforts, it is likely still at play at a timescale coincident with viral community assembly.
Studies of free-living organisms may attribute temporal turnover to immigration or local extinction (MacArthur & Wilson, 1967), but in this system, temporal turnover also arises from changes in viral activity.Sampling methods were non-invasive-detected viruses represented only those being actively shed by bats at the time of sampling.They did not account for infections in their latent phase.
Latency and lifelong infection could also contribute to maintenance of rare OTUs, as extreme longevity may allow currently rare viruses to be common at other times (Wisnoski et al., 2019), as has been documented in some plants (Magurran & Henderson, 2011).
Additionally, dormancy, a similar process as latency in its reversible state of reduced metabolic activity, has shown strong effects on microbial diversity and distributions (Lennon & Jones, 2011;Locey, 2010;Locey et al., 2020), providing support for latency as a mechanism that contributes to the form of RFDs in herpesvirus communities.
Viral infection of dead-end hosts could be another mechanism that shapes long-tailed RFDs in herpesviruses.These are instances in which a pathogen can spill over and infect a new host species but cannot further transmit from this host.Consequently, the pathogen does not establish positive population growth or emergence in the new host species.This could account for single occurrence OTUs in a host population and is analogous to a free-living organism captured in a sub-optimal habitat through which it passes without establishing.Comparisons of OTU prevalence between primary and secondary hosts shows that most herpesvirus OTUs have a core host, TA B L E 3 Host specificity index (S TD *) for each of the 13 herpesvirus OTUs that infect more than one host species.1) may be more strongly shaped by positive interactions among OTUs, with viruses facilitating co-infection (Poulin, 2007).Conversely, the less rich community in Mor.blainvillei (S = 9, Chao2 estimate = 13.485;

Viral
Table 1), may dominated by competitive interactions among OTUs (Poulin, 2007) or priority effects through activation of the host immune system.Although significant differences did not exist between most species, this could reflect under-sampling of populations.Future studies should investigate viral interactions and host heterogeneity within the context of RFDs or SADs to examine these hypotheses.

| Host specificity
Our primary hypothesis was that herpesviruses would show significant associations with particular bat species.We further hypothesized that, despite associations with a primary host species, certain viruses would also be identified in co-roosting species of bats due to the proximity of host individuals and the opportunity for viral spillover within caves.We found that nine of 13 multihost OTUs in our study had significantly higher prevalence in a single host species than in all other hosts in which it occurred.
This shows that in this system, herpesviruses are indeed associated with a single, primary host species and that cross-species transmission occurs regularly.Such regular viral sharing has implications for long-term viability of the viruses, as it facilitates the establishment of new host-virus associations after changes or disturbances to host ecology (Brooks & Ferrao, 2005;Brooks & Hoberg, 2007;Hoberg & Brooks, 2008).This is particularly important in places such as Puerto Rico, where hurricane disturbance is a defining feature of the ecosystem.In Culebrones cave, specifically, hurricane Georges caused shifts in the composition of bat species (Rodríguez-Durán, 2009).Prior to Georges, Ero.bombifrons was the most abundant species in Culebrones, whereas after Georges, at the time of sampling, the community was dominated by P. quadridens, Mor.blainvillei, and Mon.redmani.
Various factors, including the ecology, behavior, and phylogenetic relatedness of hosts, can suggest why viruses may be shared among particular species.For example, nonrandom associations are common among cave dwelling bats in Puerto Rico, most likely relating to eco-physiological characteristics of the bats, patterns of activity, and morphology of caves (Rodríguez-Durán, 1995, 1998). In

| Community structure
The community-level patterns of herpesviruses infecting Puerto Rican bats are non-random, as has been reported for viral communities in other host taxa (e.g., Anthony et al., 2015) and for oral microbiota in this same community of bats (Presley et al., 2021).
Significantly positive abundance-occupancy patterns suggest that these viral communities follow similar assembly rules as do communities of macro-organisms.Positive abundance-occupancy relationships, or the observation that abundant species tend to be widespread while rare species have limited geographic ranges, is one of the oldest documented patterns in macroecology (Brown, 1984;Darwin, 1859;Gaston, 1999;Hanski, 1982).Although the patterns can manifest via either neutral or deterministic processes, and data can support multiple co-occurring mechanisms (Gaston et al., 2000;Verberk et al., 2010;Warren & Gaston, 1997), the results have farreaching implications for ecological applications such as biodiversity monitoring (e.g., how many sites must be visited, or hosts must be sampled, to comprehensively describe the ecological community) and expanding habitat (or host) use (Gaston, 1999).

| Limitations
Many theories and rules of community ecology and evolution, including those developed to interpret occupancy-abundance patterns, require counting both species (i.e., OTUs) and individuals (i.e., abundance; Brown, 1995;Lawton, 1999).However, the number of virions, or viral particles, which can be thought of as virus "individuals", cannot be counted as can distinct, macro-organismal entities (Burrell et al., 2017;Heider & Metzner, 2014).As a result, viral ecology often relies on prevalence, as demonstrated here, and relying on prevalence assumes that viral load is equal in each sampled host.
This remains a major criticism of the field of micro-ecology (Shade et al., 2018) et al., 2015).Sequencing the full genome of constituent viruses, or at least the complete coding sequence for multiple genes under different evolutionary constraints, will provide invaluable data for future analyses, but the work presented here demonstrates that such data need not limit the investigation of all questions regarding virus ecology.We find this demonstration to be particularly useful, as one of the most comprehensively sampled datasets of global viral diversity was recently made public, the majority of sequences from which were only available from short, cPCR fragments (PREDICT Consortium, 2020, https:// data.usaid. gov).This new body of data is ripe for catalyzing a revolution of thought, a fundamental aspect of scientific discovery (Goldenfeld & Woese, 2007), and we are only just scraping the surface of how such sequence data can guide future understanding.Thus, finding ways to utilize cPCR data in ecological studies is important for advancing the field, exploiting currently available datasets, and deepening the understanding of viruses from an ecological perspective.
Despite the limitation of short sequences, our methods for delineating herpesvirus sequences into OTUs are reasonably robust.
Although the goal of our study was not to define viral species, affinity propagation clustering correctly differentiated most established herpesvirus species, even when polymerase sequences were trimmed to only ~178 base pairs.However, this was only true if the number of sequences used per species was >2.The only two species that could not be differentiated (EHV-1 and EHV-9) share >97% sequence identity in the polymerase gene and 86-95% identity genome-wide (Fukushi et al., 2012).Thus, although the short ~175 bp fragment is sufficient to distinguish most herpesviruses (at least, known herpesviruses), it is not able to separate viruses that are closely related in the polymerase gene but divergent in other parts of the genome.It is important that our results be interpreted in light of this limitation, as it is possible that some herpesvirus sequences assigned to the same OTU actually represent different viruses based on other genomic regions.
A final limitation of our study is that our methods do not detect herpesviruses that latently infect tissues other than oral epithelia.
Latent viruses are, by definition, not actively reproducing, and therefore presumably do not interact with actively replicating viruses, at least not directly.However, indirect interactions of latent viruses via the immune system may have priority effects on community assembly and consequently play an important role.Additional studies are required to clarify the role of latency on community assembly of herpesviruses.

| CON CLUS IONS
This study advances the field of viral ecology in two ways.First, it presents a novel, two-step approach to clustering viral sequences using data that are imperfect but also most readily available.In so doing, it facilitates community-level analyses of the factors shaping viral communities beyond its narrow application in this study system.Second, this work applies the two-step method to investigate the drivers of herpesvirus diversity in cave-dwelling bat communities, showing that these viruses have strong hostassociations and are likely structured, to some extent, via nonrandom processes.We suggest additional hypotheses that, to the extent of which they can be addressed in future work, represent an opportunity to uncover rules of life that transcend taxonomic silos to unite the fields of viral, microbial, and macrobial ecology.
Moreover, studies of the ecology of a taxon that is generally rare and episodically abundant may provide insights that are relevant beyond virus-host systems.

F I G U R E A 2
Cladogram for the Gammaherpesvirinae (GHV), generated using the GKY + G + I model.Hosts of previously discovered viruses are coded blue for Old World, gray for New World, and black for global (i.e.domesticated) distribution.Purple stars indicate human viruses (HHV), and only those sequences that have been officially classified into genera ratified by the ICTV are highlighted in color.GenBank accession numbers for these sequences are listed in parentheses.All sequences beginning in "OTU" were sampled from Puerto Rican bats in this study, with the following four-and two-number codes representing sample ID and clone number (01-12), respectively.Sequences chosen to represent OTUs were all exemplars of their respective clusters.The cladogram was generated using a short, 169 base pair alignment, so any inferences should be made with caution.Sequence references for bat hosts: Wibbelt et al. (2007), Shabman et al. (2016), Wada et al. (2018), andJames et al. (2020).

Figure 1 ,
Figure 1, Step 3a).Because multiple troughs exist in the histogram, and to remove the subjectivity involved in determining which trough to use as the cutoff, a second step was introduced to objectively support the use of one cutoff, as opposed to another.In this step, the number of OTUs was determined using affinity propagation via the R package apcluster (Bodenhofer et al., 2011; Figure 1, Step 3b).Affinity propagation is a clustering algorithm that examines all cells in a similarity matrix (e.g., in a PID matrix) to identify exemplars OTU richness per host population ranged from three to 15.For comparison, humans are the only host species for which comprehensive data are available on herpesvirus richness, and only eight known herpesvirus species infect people, despite the cosmopolitan distribution of humans.OTU richness per host individual ranged from one to four.Seventy-eight individuals were coinfected with more than one OTU: 67 individuals had two OTUs, 10 individuals had three OTUs, and one individual had four OTUs.Cladograms were generated using maximum likelihood for the Beta and Gamma subfamilies and had overall low bootstrap support (FiguresA1 and A2, respectively).For both subfamilies, OTUs were generally most closely related to other viruses with bat hosts, although caution should be exercised when interpreting relationships due to the short, ~175 bp sequence lengths and the low bootstrap support.In this context, the cladograms should be used to illustrate degree of genetic separation between lineages, rather than the accurate reconstruction of shared ancestors.

F I G U R E 4
Rank frequency distributions for all bats, for Culebrones cave, and for each individual host species.OTUs are listed on the x-axis ranked from most frequent (i.e., common) to least frequent (i.e., rare).The y-axis represents the frequency of occurrence (i.e., prevalence), or the number of hosts in which the OTU was detected.OTUs with frequencies of occurrence at or above the red lines (1/S Total ) represent common OTUs, whereas those below the red lines represent rare OTUs.Total hosts sampled (N), number of OTUs detected (S), and number of common (S common ) and rare (S rare ) are shown for each viral community.A. jamaicensis represents the only species not found in Culebrones Cave, as demonstrated by the *.

Host species N N positive Prevalence S (% completeness) N common, N rare Chao2 (95% lower, upper)
Note: Numbers of host individuals sampled (N), number testing positive for herpesvirus (N positive), and prevalence of herpesviruses are calculated for each host species and for each cave.Number of herpesvirus OTUs detected empirically (S) and estimate of herpesvirus OTU richness (Chao2), are compared to estimate the percentage of herpesvirus OTUs that was detected in this study (% completeness).No viruses were detected from Eptesicus fuscus, so NA values are produced.
Keeping the experiment-wise error rate at .05, individual p-values for each pairwise comparison of host populations are considered significant, and bolded, at a value of less than or equal to .003.
(McGill et al., 2007;Poulin et al., 2008)ruses may follow well-established ecological "laws" that are characteristic of freeliving and macroparasitic organisms(McGill et al., 2007;Poulin et al., 2008), supporting the use of ecological perspectives to better understand viral dynamics.

sub-family S TD *
, as the extent to which theories based on abundance can be applied to systems for which only prevalence is measured is